<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Master leases</title>
    <link rel="stylesheet" href="gettingStarted.css" type="text/css" />
    <meta name="generator" content="DocBook XSL Stylesheets V1.73.2" />
    <link rel="start" href="index.html" title="Berkeley DB Programmer's Reference Guide" />
    <link rel="up" href="rep.html" title="Chapter 13.  Berkeley DB Replication" />
    <link rel="prev" href="rep_trans.html" title="Transactional guarantees" />
    <link rel="next" href="rep_ryw.html" title="Read your writes consistency" />
  </head>
  <body>
    <div xmlns="" class="navheader">
      <div class="libver">
        <p>Library Version 12.1.6.2</p>
      </div>
      <table width="100%" summary="Navigation header">
        <tr>
          <th colspan="3" align="center">Master leases</th>
        </tr>
        <tr>
          <td width="20%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
          <th width="60%" align="center">Chapter 13.  Berkeley DB Replication </th>
          <td width="20%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
        </tr>
      </table>
      <hr />
    </div>
    <div class="sect1" lang="en" xml:lang="en">
      <div class="titlepage">
        <div>
          <div>
            <h2 class="title" style="clear: both"><a id="rep_lease"></a>Master leases</h2>
          </div>
        </div>
      </div>
      <div class="toc">
        <dl>
          <dt>
            <span class="sect2">
              <a href="rep_lease.html#masterlease_change_groupsize">Changing group
            size</a>
            </span>
          </dt>
        </dl>
      </div>
      <p> 
        Some applications have strict requirements about the
        consistency of data read on a master site. Berkeley DB
        provides a mechanism called master leases to provide such
        consistency. Without master leases, it is sometimes possible
        for Berkeley DB to return old data to an application when
        newer data is available due to unfortunate scheduling as
        illustrated below: 
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li><span class="bold"><strong>Application on master
                site</strong></span>: Read data item
                <span class="emphasis"><em>foo</em></span> via Berkeley DB <a href="../api_reference/C/dbget.html" class="olink">DB-&gt;get()</a> or
            <a href="../api_reference/C/dbcget.html" class="olink">DBC-&gt;get()</a> call. 
        </li>
          <li><span class="bold"><strong>Application on master
                site</strong></span>: sleep, get descheduled, etc.
        </li>
          <li><span class="bold"><strong>System</strong></span>: Master changes
            role, becomes a client. 
        </li>
          <li><span class="bold"><strong>System</strong></span>: New site is
            elected master.
        </li>
          <li><span class="bold"><strong>System</strong></span>: New master
            modifies data item <span class="emphasis"><em>foo</em></span>.
        </li>
          <li><span class="bold"><strong>Application</strong></span>: Berkeley DB
            returns old data for <span class="emphasis"><em>foo</em></span> to
            application.
        </li>
        </ol>
      </div>
      <p> 
        By using master leases, Berkeley DB can provide guarantees
        about the consistency of data read on a master site. The
        master site can be considered a recognized authority for the
        data and consequently can provide authoritative reads. Clients
        grant master leases to a master site. By doing so, clients
        acknowledge the right of that site to retain the role of
        master for a period of time. During that period of time,
        clients cannot elect a new master, become master, or grant
        their lease to another site.
    </p>
      <p> 
        By holding a collection of granted leases, a master site
        can guarantee to the application that the data returned is the
        current, authoritative value. As a master performs operations,
        it continually requests updated grants from the clients. When
        a read operation is required, the master guarantees that it
        holds a valid collection of lease grants from clients before
        returning data to the application. By holding leases, Berkeley
        DB provides several guarantees to the application: 
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li>
            Authoritative reads: A guarantee that the data
            being read by the application is the current value. 
        </li>
          <li>
            <p>
                Durability from rollback: A guarantee that the data
                being written or read by the application is permanent
                across a majority of sites and will never be
                rolled back. 
            </p>
            <p>
                The rollback guarantee also depends on the
                <a href="../api_reference/C/envset_flags.html#envset_flags_DB_TXN_NOSYNC" class="olink">DB_TXN_NOSYNC</a> flag. The guarantee is effective as
                long as there isn't a failure of half of the
                replication group while clients have granted leases
                but are holding the updates in their cache. The
                application must weigh the performance impact of
                synchronous transactions against the risk of the
                failure of at least half of the replication group. If
                clients grant a lease while holding updated data in
                cache, and failure occurs, then the data is no longer
                present on the clients and rollback can occur if a
                sufficient number of other sites also crash. 
            </p>
            <p> 
                The guarantee that data will not be rolled back
                applies only to data successfully committed on a
                master. Data read on a client, or read while ignoring
                leases can be rolled back. 
            </p>
          </li>
          <li>
            <p> 
                Freshness: A guarantee that the data being read by
                the application on the <span class="emphasis"><em>master</em></span> is
                up-to-date and has not been modified or removed during
                the read. 
            </p>
            <p>
                The read authority is only on the master. Read
                operations on a client always ignore leases and
                consequently, these operations can return stale data.
            </p>
          </li>
          <li>
            <p> 
                Master viability: A guarantee that a current master
                with valid leases cannot encounter a duplicate master
                situation.
            </p>
            <p> 
                Leases remove the possibility of a duplicate master
                situation that forces the current master to downgrade
                to a client. However, it is still possible that old
                masters with expired leases can discover a later
                master and return <a href="../api_reference/C/repmessage.html#repmsg_DB_REP_DUPMASTER" class="olink">DB_REP_DUPMASTER</a> to the
                application. 
            </p>
          </li>
        </ol>
      </div>
      <p>
        There are several requirements of the application using
        leases:
    </p>
      <div class="orderedlist">
        <ol type="1">
          <li> 
            Replication Manager applications must configure a
            majority (or larger) acknowledgement policy via the
            <a href="../api_reference/C/repmgrset_ack_policy.html" class="olink">DB_ENV-&gt;repmgr_set_ack_policy()</a> method. Base API applications must
            implement and enforce such a policy on their own. 
        </li>
          <li> 
            Base API applications must return an error from the
            send callback function when the majority acknowledgement
            policy is not met for permanent records marked with
            <a href="../api_reference/C/reptransport.html#transport_DB_REP_PERMANENT" class="olink">DB_REP_PERMANENT</a>. Note that the Replication Manager
            automatically fulfills this requirement. 
        </li>
          <li> 
            Base API applications must set the number of sites
            in the group using the <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV-&gt;rep_set_nsites()</a> method before starting
            replication and cannot change it during operation. 
        </li>
          <li> 
            Using leases in a replication group is all or none.
            Behavior is undefined when some sites configure leases and
            others do not. Use the <a href="../api_reference/C/repconfig.html" class="olink">DB_ENV-&gt;rep_set_config()</a> method to turn on
            leases.
        </li>
          <li>
            The configured lease timeout value must be the same
            on all sites in a replication group, set via the
            <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV-&gt;rep_set_timeout()</a> method. 
        </li>
          <li> 
            The configured clock skew ratio must be the same on
            all sites in a replication group. This value defaults to
            no skew, but can be set via the <a href="../api_reference/C/repclockskew.html" class="olink">DB_ENV-&gt;rep_set_clockskew()</a> method. 
        </li>
          <li> 
            Applications that care about read guarantees must
            perform all read operations on the master. Reading on a
            client does not guarantee freshness. 
        </li>
          <li>
            The application must use elections to choose a
            master site. It must never simply declare a master without
            having won an election (as is allowed without Master
            Leases). 
        </li>
          <li>
            Unelectable (zero priority) sites never grant
            leases and cannot be used to guarantee data durability. A
            majority of sites in the replication group must be
            electable in order to meet the requirement of getting
            lease grants from a majority of sites. Minimizing the
            number of unelectable sites improves replication group
            availability.
        </li>
        </ol>
      </div>
      <p> 
        Master leases are based on timeouts. Berkeley DB assumes
        that time always runs forward. Users who change the system
        clock on either client or master sites when leases are in use
        void all guarantees and can get undefined behavior. See the
        <a href="../api_reference/C/repset_timeout.html" class="olink">DB_ENV-&gt;rep_set_timeout()</a> method for more information.
    </p>
      <p>
        Applications using master leases should be prepared to
        handle <code class="literal">DB_REP_LEASE_EXPIRED</code> errors from
        read operations on a master and from the <a href="../api_reference/C/txncommit.html" class="olink">DB_TXN-&gt;commit()</a> method.
    </p>
      <p> 
        Read operations on a master that should not be subject to
        leases can use the <a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to the <a href="../api_reference/C/dbget.html" class="olink">DB-&gt;get()</a>
        method. Read operations on a client always imply leases are
        ignored. 
    </p>
      <p> 
        Master lease checks cannot succeed until a majority of
        sites have completed client synchronization. Read operations
        on a master performed before this condition is met can use the
        <a href="../api_reference/C/dbget.html#get_DB_IGNORE_LEASE" class="olink">DB_IGNORE_LEASE</a> flag to avoid errors. 
    </p>
      <p>
        Clients are forbidden from participating in elections while
        they have an outstanding lease granted to a master. Therefore,
        if the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a> method is called, then Berkeley DB will
        block, waiting until its lease grant expires before
        participating in any election. While it waits, the client
        attempts to contact the current master. If the client finds a
        current master, then it returns from the <a href="../api_reference/C/repelect.html" class="olink">DB_ENV-&gt;rep_elect()</a> method.
        When leases are configured and the lease has never yet been
        granted (on start-up), clients must wait a full lease timeout
        before participating in an election. 
    </p>
      <div class="sect2" lang="en" xml:lang="en">
        <div class="titlepage">
          <div>
            <div>
              <h3 class="title"><a id="masterlease_change_groupsize"></a>Changing group
            size</h3>
            </div>
          </div>
        </div>
        <p>
            If you are using master leases and you change the size
            of your replication group, there is a remote possibility
            that you can lose some data previously thought to be
            durable. This is only true for users of the Base API.
        </p>
        <p>
            The problem can arise if you are removing sites from
            your replication group. (You might be increasing the size
            of your group overall, but if you remove all of the wrong
            sites you can lose data.) 
        </p>
        <p> 
            Suppose you have a replication group with five sites;
            A, B, C, D and E; and you are using a quorum
            acknowledgement policy. Then:
        </p>
        <div class="orderedlist">
          <ol type="1">
            <li>
              <p> 
                    Master A replicates a transaction to replicas B
                    and C. Those sites acknowledge the write activity.
                </p>
            </li>
            <li>
              <p> 
                    Sites D and E do not receive the transaction.
                    However, B and C have acknowledged the
                    transaction, which means the acknowledgement
                    policy is met and so the transaction is considered
                    durable. 
                </p>
            </li>
            <li>
              <p>
                    You shut down sites B and C. Now only A has the
                    transaction. 
                </p>
            </li>
            <li>
              <p>
                    You decrease the size of your replication group
                    to 3 using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV-&gt;rep_set_nsites()</a>.
                </p>
            </li>
            <li>
              <p>
                    You shut down or otherwise lose site A. 
                </p>
            </li>
            <li>
              <p> 
                    Sites D and E hold an election. Because the
                    size of the replication group is 3, they have
                    enough sites to successfully hold an election.
                    However, neither site has the transaction in
                    question. In this way, the transaction can become
                    lost.
                </p>
            </li>
          </ol>
        </div>
        <p>
            An alternative scenario exists where you do not change
            the size of your replication group, or you actually
            increase the size of your replication group, but in the
            process you happen to remove the exact wrong sites: 
        </p>
        <div class="orderedlist">
          <ol type="1">
            <li>
              <p>
                    Master A replicates a transaction to replicas B
                    and C. Those sites acknowledge the write activity.
                </p>
            </li>
            <li>
              <p> 
                    Sites D and E do not receive the transaction.
                    However, B and C have acknowledged the
                    transaction, which means the acknowledgement
                    policy is met and so the transaction is considered
                    durable.
                </p>
            </li>
            <li>
              <p> 
                    You shut down sites B and C. Now only A has the
                    transaction. 
                </p>
            </li>
            <li>
              <p>
                    You add three new sites to your replication
                    group: F, G and H, increasing the size of your
                    replication group to 6 using <a href="../api_reference/C/repnsites.html" class="olink">DB_ENV-&gt;rep_set_nsites()</a>. 
                </p>
            </li>
            <li>
              <p>
                    You shut down or otherwise lose site A before
                    F, G and H can be fully populated with data.
                </p>
            </li>
            <li>
              <p>
                    Sites D, E, F, G and H hold an election.
                    Because the size of the replication group is 6,
                    they have enough sites to successfully hold an
                    election. However, none of these sites has the
                    transaction in question. In this way, the
                    transaction can become lost. 
                </p>
            </li>
          </ol>
        </div>
        <p>
            This scenario represents a race condition that would be
            highly unlikely to be seen outside of a lab environment.
            To minimize the chance of this race condition occurring to
            the absolute minimum, do one or more of the following when
            using master leases with the Base API:
        </p>
        <div class="orderedlist">
          <ol type="1">
            <li>
              <p>
                    Require all sites to acknowledge transaction
                    commits. 
                </p>
            </li>
            <li>
              <p>
                    Never change the size of your replication group
                    unless all sites in the group are running and
                    communicating normally with one another. 
                </p>
            </li>
            <li>
              <p> 
                    Don't remove (or replace) a large percentage of
                    your sites from your replication group unless all
                    sites in the group are running and communicating
                    normally with one another. If you are going to
                    remove a large percentage of your sites from your
                    replication group, try removing just one site at a
                    time, pausing in between each removal to give the
                    replication group a chance to fully distribute all
                    writes before removing the next site. 
                </p>
            </li>
          </ol>
        </div>
      </div>
    </div>
    <div class="navfooter">
      <hr />
      <table width="100%" summary="Navigation footer">
        <tr>
          <td width="40%" align="left"><a accesskey="p" href="rep_trans.html">Prev</a> </td>
          <td width="20%" align="center">
            <a accesskey="u" href="rep.html">Up</a>
          </td>
          <td width="40%" align="right"> <a accesskey="n" href="rep_ryw.html">Next</a></td>
        </tr>
        <tr>
          <td width="40%" align="left" valign="top">Transactional guarantees </td>
          <td width="20%" align="center">
            <a accesskey="h" href="index.html">Home</a>
          </td>
          <td width="40%" align="right" valign="top"> Read your writes consistency</td>
        </tr>
      </table>
    </div>
  </body>
</html>
